how to: use the psych package for factor analysis and data

86
How To: Use the psych package for Factor Analysis and data reduction William Revelle Department of Psychology Northwestern University April 24, 2017 Contents 1 Overview of this and related documents 3 1.1 Jump starting the psych package–a guide for the impatient ......... 3 2 Overview of this and related documents 6 3 Getting started 6 4 Basic data analysis 7 4.1 Getting the data by using read.file ....................... 7 4.2 Data input from the clipboard ......................... 8 4.3 Basic descriptive statistics ............................ 9 4.4 Simple descriptive graphics ........................... 9 4.4.1 Scatter Plot Matrices .......................... 10 4.4.2 Correlational structure .......................... 10 4.4.3 Heatmap displays of correlational structure .............. 12 4.5 Polychoric, tetrachoric, polyserial, and biserial correlations .......... 15 5 Item and scale analysis 15 5.1 Dimension reduction through factor analysis and cluster analysis ...... 15 5.1.1 Minimum Residual Factor Analysis ................... 17 5.1.2 Principal Axis Factor Analysis ..................... 18 5.1.3 Weighted Least Squares Factor Analysis ................ 18 5.1.4 Principal Components analysis (PCA) ................. 24 5.1.5 Hierarchical and bi-factor solutions ................... 24 5.1.6 Item Cluster Analysis: iclust ...................... 28 1

Upload: others

Post on 07-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How To: Use the psych package for Factor Analysis and data

How To Use the psych package for Factor Analysis and data

reduction

William RevelleDepartment of PsychologyNorthwestern University

April 24 2017

Contents

1 Overview of this and related documents 311 Jump starting the psych packagendasha guide for the impatient 3

2 Overview of this and related documents 6

3 Getting started 6

4 Basic data analysis 741 Getting the data by using readfile 742 Data input from the clipboard 843 Basic descriptive statistics 944 Simple descriptive graphics 9

441 Scatter Plot Matrices 10442 Correlational structure 10443 Heatmap displays of correlational structure 12

45 Polychoric tetrachoric polyserial and biserial correlations 15

5 Item and scale analysis 1551 Dimension reduction through factor analysis and cluster analysis 15

511 Minimum Residual Factor Analysis 17512 Principal Axis Factor Analysis 18513 Weighted Least Squares Factor Analysis 18514 Principal Components analysis (PCA) 24515 Hierarchical and bi-factor solutions 24516 Item Cluster Analysis iclust 28

1

52 Confidence intervals using bootstrapping techniques 3153 Comparing factorcomponentcluster solutions 3154 Determining the number of dimensions to extract 37

541 Very Simple Structure 38542 Parallel Analysis 40

55 Factor extension 40

6 Classical Test Theory and Reliability 4261 Reliability of a single scale 4562 Using omega to find the reliability of a single scale 4863 Estimating ωh using Confirmatory Factor Analysis 52

631 Other estimates of reliability 5464 Reliability and correlations of multiple scales within an inventory 54

641 Scoring from raw data 54642 Forming scales from a correlation matrix 57

65 Scoring Multiple Choice Items 5966 Item analysis 60

7 Item Response Theory analysis 6071 Factor analysis and Item Response Theory 6272 Speeding up analyses 6673 IRT based scoring 68

8 Multilevel modeling 7081 Decomposing data into within and between level correlations using statsBy 7082 Generating and displaying multilevel data 72

9 Set Correlation and Multiple Regression from the correlation matrix 72

10 Simulation functions 75

11 Graphical Displays 77

12 Miscellaneous functions 79

13 Data sets 80

14 Development version and a users guide 81

15 Psychometric Theory 82

16 SessionInfo 82

2

1 Overview of this and related documents

To do basic and advanced personality and psychological research using R is not as compli-cated as some think This is one of a set of ldquoHow Tordquo to do various things using R (R CoreTeam 2017) particularly using the psych (Revelle 2017) package

The current list of How Torsquos includes

1 Installing R and some useful packages

2 Using R and the psych package to find omegah and ωt

3 Using R and the psych for factor analysis and principal components analysis (Thisdocument)

4 Using the scoreitems function to find scale scores and scale statistics

5 An overview (vignette) of the psych package

11 Jump starting the psych packagendasha guide for the impatient

You have installed psych (section 3) and you want to use it without reading much moreWhat should you do

1 Activate the psych packageR code

library(psych)

2 Input your data (section 42) If your file name ends in sav text txt csv xptrds Rds rda or RDATA then just read it in directly using readfile Or youcan go to your friendly text editor or data manipulation program (eg Excel) andcopy the data to the clipboard Include a first line that has the variable labels Pasteit into psych using the readclipboardtab command

R codemyData lt- readfile() this will open a search window on your machine and read or load the fileorfirst copy your file to your clipboard and thenmyData lt- readclipboardtab() if you have an excel file

3 Make sure that what you just read is right Describe it (section 43) and perhapslook at the first and last few lines If you want to ldquoviewrdquo the first and last few linesusing a spreadsheet like viewer use quickView

R codedescribe(myData)headTail(myData)

3

orquickView(myData)

4 Look at the patterns in the data If you have fewer than about 10 variables look atthe SPLOM (Scatter Plot Matrix) of the data using pairspanels (section 441)

R codepairspanels(myData)

5 Find the correlations of all of your data

bull Descriptively (just the values) (section 442)R code

lowerCor(myData)

bull Graphically (section 443)R code

corPlot(r)

6 Test for the number of factors in your data using parallel analysis (faparallelsection 542) or Very Simple Structure (vss 541)

R codefaparallel(myData)vss(myData)

7 Factor analyze (see section 51) the data with a specified number of factors (thedefault is 1) the default method is minimum residual the default rotation for morethan one factor is oblimin There are many more possibilities (see sections 511-513)Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm(Revelle 1979) (see section 516) Also consider a hierarchical factor solution to findcoefficient ω (see 515)

R codefa(myData)iclust(myData)omega(myData)

8 Some people like to find coefficient α as an estimate of reliability This may be donefor a single scale using the alpha function (see 61) Perhaps more useful is theability to create several scales as unweighted averages of specified items using thescoreIems function (see 64) and to find various estimates of internal consistency forthese scales find their intercorrelations and find scores for all the subjects

R codealpha(myData) score all of the items as part of one scalemyKeys lt- makekeys(nvar=20list(first = c(1-35-7810)second=c(24-61115-16)))myscores lt- scoreItems(myKeysmyData) form several scalesmyscores show the highlights of the results

4

At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading the entire overviewvignette helpful to get a broader understanding of what can be done in R using the psychRemember that the help command () is available for every function Try running theexamples for each help page

5

2 Overview of this and related documents

The psych package (Revelle 2017) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook

Some of the functions (eg readfile readclipboard describe pairspanels scat-terhist errorbars multihist bibars) are useful for basic data entry and descrip-tive analyses

Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Determining the num-ber of factors or components to extract may be done by using the Very Simple Structure(Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer 1976)(MAP) or parallel analysis (faparallel) criteria Item Response Theory (IRT) models fordichotomous or polytomous items may be found by factoring tetrachoric or polychoriccorrelation matrices and expressing the resulting parameters in terms of location and dis-crimination using irtfa Bifactor and hierarchical factor structures may be estimated byusing Schmid Leiman transformations (Schmid and Leiman 1957) (schmid) to transforma hierarchical factor structure into a bifactor solution (Holzinger and Swineford 1937)Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels corre-lation ldquoheat mapsrdquo (corplot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram as well as item response characteristics and itemand test information characteristic curves plotirt and plotpoly

3 Getting started

Some of the functions described in this overview require other packages Particularly usefulfor rotating the results of factor analyses (from eg fa or principal) or hierarchicalfactor models using omega or schmid is the GPArotation package These and other useful

6

packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary

4 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

R codelibrary(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

R codeFirst lt- function(x) library(psych)

41 Getting the data by using readfile

Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly

R codemydata lt- readfile()

If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command

R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each

If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are

7

assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)

To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values

R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric

You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding

R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding

42 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

R codemydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

43 Basic descriptive statistics

Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized

gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441

44 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

9

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries

441 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic

442 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])

10

gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device

1

Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph

11

edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)gt round(both2)

education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)

education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA

443 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

12

gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device

1

Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well

13

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 2: How To: Use the psych package for Factor Analysis and data

52 Confidence intervals using bootstrapping techniques 3153 Comparing factorcomponentcluster solutions 3154 Determining the number of dimensions to extract 37

541 Very Simple Structure 38542 Parallel Analysis 40

55 Factor extension 40

6 Classical Test Theory and Reliability 4261 Reliability of a single scale 4562 Using omega to find the reliability of a single scale 4863 Estimating ωh using Confirmatory Factor Analysis 52

631 Other estimates of reliability 5464 Reliability and correlations of multiple scales within an inventory 54

641 Scoring from raw data 54642 Forming scales from a correlation matrix 57

65 Scoring Multiple Choice Items 5966 Item analysis 60

7 Item Response Theory analysis 6071 Factor analysis and Item Response Theory 6272 Speeding up analyses 6673 IRT based scoring 68

8 Multilevel modeling 7081 Decomposing data into within and between level correlations using statsBy 7082 Generating and displaying multilevel data 72

9 Set Correlation and Multiple Regression from the correlation matrix 72

10 Simulation functions 75

11 Graphical Displays 77

12 Miscellaneous functions 79

13 Data sets 80

14 Development version and a users guide 81

15 Psychometric Theory 82

16 SessionInfo 82

2

1 Overview of this and related documents

To do basic and advanced personality and psychological research using R is not as compli-cated as some think This is one of a set of ldquoHow Tordquo to do various things using R (R CoreTeam 2017) particularly using the psych (Revelle 2017) package

The current list of How Torsquos includes

1 Installing R and some useful packages

2 Using R and the psych package to find omegah and ωt

3 Using R and the psych for factor analysis and principal components analysis (Thisdocument)

4 Using the scoreitems function to find scale scores and scale statistics

5 An overview (vignette) of the psych package

11 Jump starting the psych packagendasha guide for the impatient

You have installed psych (section 3) and you want to use it without reading much moreWhat should you do

1 Activate the psych packageR code

library(psych)

2 Input your data (section 42) If your file name ends in sav text txt csv xptrds Rds rda or RDATA then just read it in directly using readfile Or youcan go to your friendly text editor or data manipulation program (eg Excel) andcopy the data to the clipboard Include a first line that has the variable labels Pasteit into psych using the readclipboardtab command

R codemyData lt- readfile() this will open a search window on your machine and read or load the fileorfirst copy your file to your clipboard and thenmyData lt- readclipboardtab() if you have an excel file

3 Make sure that what you just read is right Describe it (section 43) and perhapslook at the first and last few lines If you want to ldquoviewrdquo the first and last few linesusing a spreadsheet like viewer use quickView

R codedescribe(myData)headTail(myData)

3

orquickView(myData)

4 Look at the patterns in the data If you have fewer than about 10 variables look atthe SPLOM (Scatter Plot Matrix) of the data using pairspanels (section 441)

R codepairspanels(myData)

5 Find the correlations of all of your data

bull Descriptively (just the values) (section 442)R code

lowerCor(myData)

bull Graphically (section 443)R code

corPlot(r)

6 Test for the number of factors in your data using parallel analysis (faparallelsection 542) or Very Simple Structure (vss 541)

R codefaparallel(myData)vss(myData)

7 Factor analyze (see section 51) the data with a specified number of factors (thedefault is 1) the default method is minimum residual the default rotation for morethan one factor is oblimin There are many more possibilities (see sections 511-513)Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm(Revelle 1979) (see section 516) Also consider a hierarchical factor solution to findcoefficient ω (see 515)

R codefa(myData)iclust(myData)omega(myData)

8 Some people like to find coefficient α as an estimate of reliability This may be donefor a single scale using the alpha function (see 61) Perhaps more useful is theability to create several scales as unweighted averages of specified items using thescoreIems function (see 64) and to find various estimates of internal consistency forthese scales find their intercorrelations and find scores for all the subjects

R codealpha(myData) score all of the items as part of one scalemyKeys lt- makekeys(nvar=20list(first = c(1-35-7810)second=c(24-61115-16)))myscores lt- scoreItems(myKeysmyData) form several scalesmyscores show the highlights of the results

4

At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading the entire overviewvignette helpful to get a broader understanding of what can be done in R using the psychRemember that the help command () is available for every function Try running theexamples for each help page

5

2 Overview of this and related documents

The psych package (Revelle 2017) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook

Some of the functions (eg readfile readclipboard describe pairspanels scat-terhist errorbars multihist bibars) are useful for basic data entry and descrip-tive analyses

Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Determining the num-ber of factors or components to extract may be done by using the Very Simple Structure(Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer 1976)(MAP) or parallel analysis (faparallel) criteria Item Response Theory (IRT) models fordichotomous or polytomous items may be found by factoring tetrachoric or polychoriccorrelation matrices and expressing the resulting parameters in terms of location and dis-crimination using irtfa Bifactor and hierarchical factor structures may be estimated byusing Schmid Leiman transformations (Schmid and Leiman 1957) (schmid) to transforma hierarchical factor structure into a bifactor solution (Holzinger and Swineford 1937)Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels corre-lation ldquoheat mapsrdquo (corplot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram as well as item response characteristics and itemand test information characteristic curves plotirt and plotpoly

3 Getting started

Some of the functions described in this overview require other packages Particularly usefulfor rotating the results of factor analyses (from eg fa or principal) or hierarchicalfactor models using omega or schmid is the GPArotation package These and other useful

6

packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary

4 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

R codelibrary(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

R codeFirst lt- function(x) library(psych)

41 Getting the data by using readfile

Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly

R codemydata lt- readfile()

If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command

R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each

If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are

7

assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)

To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values

R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric

You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding

R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding

42 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

R codemydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

43 Basic descriptive statistics

Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized

gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441

44 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

9

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries

441 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic

442 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])

10

gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device

1

Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph

11

edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)gt round(both2)

education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)

education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA

443 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

12

gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device

1

Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well

13

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 3: How To: Use the psych package for Factor Analysis and data

1 Overview of this and related documents

To do basic and advanced personality and psychological research using R is not as compli-cated as some think This is one of a set of ldquoHow Tordquo to do various things using R (R CoreTeam 2017) particularly using the psych (Revelle 2017) package

The current list of How Torsquos includes

1 Installing R and some useful packages

2 Using R and the psych package to find omegah and ωt

3 Using R and the psych for factor analysis and principal components analysis (Thisdocument)

4 Using the scoreitems function to find scale scores and scale statistics

5 An overview (vignette) of the psych package

11 Jump starting the psych packagendasha guide for the impatient

You have installed psych (section 3) and you want to use it without reading much moreWhat should you do

1 Activate the psych packageR code

library(psych)

2 Input your data (section 42) If your file name ends in sav text txt csv xptrds Rds rda or RDATA then just read it in directly using readfile Or youcan go to your friendly text editor or data manipulation program (eg Excel) andcopy the data to the clipboard Include a first line that has the variable labels Pasteit into psych using the readclipboardtab command

R codemyData lt- readfile() this will open a search window on your machine and read or load the fileorfirst copy your file to your clipboard and thenmyData lt- readclipboardtab() if you have an excel file

3 Make sure that what you just read is right Describe it (section 43) and perhapslook at the first and last few lines If you want to ldquoviewrdquo the first and last few linesusing a spreadsheet like viewer use quickView

R codedescribe(myData)headTail(myData)

3

orquickView(myData)

4 Look at the patterns in the data If you have fewer than about 10 variables look atthe SPLOM (Scatter Plot Matrix) of the data using pairspanels (section 441)

R codepairspanels(myData)

5 Find the correlations of all of your data

bull Descriptively (just the values) (section 442)R code

lowerCor(myData)

bull Graphically (section 443)R code

corPlot(r)

6 Test for the number of factors in your data using parallel analysis (faparallelsection 542) or Very Simple Structure (vss 541)

R codefaparallel(myData)vss(myData)

7 Factor analyze (see section 51) the data with a specified number of factors (thedefault is 1) the default method is minimum residual the default rotation for morethan one factor is oblimin There are many more possibilities (see sections 511-513)Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm(Revelle 1979) (see section 516) Also consider a hierarchical factor solution to findcoefficient ω (see 515)

R codefa(myData)iclust(myData)omega(myData)

8 Some people like to find coefficient α as an estimate of reliability This may be donefor a single scale using the alpha function (see 61) Perhaps more useful is theability to create several scales as unweighted averages of specified items using thescoreIems function (see 64) and to find various estimates of internal consistency forthese scales find their intercorrelations and find scores for all the subjects

R codealpha(myData) score all of the items as part of one scalemyKeys lt- makekeys(nvar=20list(first = c(1-35-7810)second=c(24-61115-16)))myscores lt- scoreItems(myKeysmyData) form several scalesmyscores show the highlights of the results

4

At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading the entire overviewvignette helpful to get a broader understanding of what can be done in R using the psychRemember that the help command () is available for every function Try running theexamples for each help page

5

2 Overview of this and related documents

The psych package (Revelle 2017) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook

Some of the functions (eg readfile readclipboard describe pairspanels scat-terhist errorbars multihist bibars) are useful for basic data entry and descrip-tive analyses

Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Determining the num-ber of factors or components to extract may be done by using the Very Simple Structure(Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer 1976)(MAP) or parallel analysis (faparallel) criteria Item Response Theory (IRT) models fordichotomous or polytomous items may be found by factoring tetrachoric or polychoriccorrelation matrices and expressing the resulting parameters in terms of location and dis-crimination using irtfa Bifactor and hierarchical factor structures may be estimated byusing Schmid Leiman transformations (Schmid and Leiman 1957) (schmid) to transforma hierarchical factor structure into a bifactor solution (Holzinger and Swineford 1937)Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels corre-lation ldquoheat mapsrdquo (corplot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram as well as item response characteristics and itemand test information characteristic curves plotirt and plotpoly

3 Getting started

Some of the functions described in this overview require other packages Particularly usefulfor rotating the results of factor analyses (from eg fa or principal) or hierarchicalfactor models using omega or schmid is the GPArotation package These and other useful

6

packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary

4 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

R codelibrary(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

R codeFirst lt- function(x) library(psych)

41 Getting the data by using readfile

Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly

R codemydata lt- readfile()

If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command

R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each

If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are

7

assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)

To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values

R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric

You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding

R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding

42 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

R codemydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

43 Basic descriptive statistics

Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized

gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441

44 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

9

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries

441 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic

442 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])

10

gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device

1

Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph

11

edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)gt round(both2)

education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)

education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA

443 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

12

gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device

1

Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well

13

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 4: How To: Use the psych package for Factor Analysis and data

orquickView(myData)

4 Look at the patterns in the data If you have fewer than about 10 variables look atthe SPLOM (Scatter Plot Matrix) of the data using pairspanels (section 441)

R codepairspanels(myData)

5 Find the correlations of all of your data

bull Descriptively (just the values) (section 442)R code

lowerCor(myData)

bull Graphically (section 443)R code

corPlot(r)

6 Test for the number of factors in your data using parallel analysis (faparallelsection 542) or Very Simple Structure (vss 541)

R codefaparallel(myData)vss(myData)

7 Factor analyze (see section 51) the data with a specified number of factors (thedefault is 1) the default method is minimum residual the default rotation for morethan one factor is oblimin There are many more possibilities (see sections 511-513)Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm(Revelle 1979) (see section 516) Also consider a hierarchical factor solution to findcoefficient ω (see 515)

R codefa(myData)iclust(myData)omega(myData)

8 Some people like to find coefficient α as an estimate of reliability This may be donefor a single scale using the alpha function (see 61) Perhaps more useful is theability to create several scales as unweighted averages of specified items using thescoreIems function (see 64) and to find various estimates of internal consistency forthese scales find their intercorrelations and find scores for all the subjects

R codealpha(myData) score all of the items as part of one scalemyKeys lt- makekeys(nvar=20list(first = c(1-35-7810)second=c(24-61115-16)))myscores lt- scoreItems(myKeysmyData) form several scalesmyscores show the highlights of the results

4

At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading the entire overviewvignette helpful to get a broader understanding of what can be done in R using the psychRemember that the help command () is available for every function Try running theexamples for each help page

5

2 Overview of this and related documents

The psych package (Revelle 2017) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook

Some of the functions (eg readfile readclipboard describe pairspanels scat-terhist errorbars multihist bibars) are useful for basic data entry and descrip-tive analyses

Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Determining the num-ber of factors or components to extract may be done by using the Very Simple Structure(Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer 1976)(MAP) or parallel analysis (faparallel) criteria Item Response Theory (IRT) models fordichotomous or polytomous items may be found by factoring tetrachoric or polychoriccorrelation matrices and expressing the resulting parameters in terms of location and dis-crimination using irtfa Bifactor and hierarchical factor structures may be estimated byusing Schmid Leiman transformations (Schmid and Leiman 1957) (schmid) to transforma hierarchical factor structure into a bifactor solution (Holzinger and Swineford 1937)Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels corre-lation ldquoheat mapsrdquo (corplot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram as well as item response characteristics and itemand test information characteristic curves plotirt and plotpoly

3 Getting started

Some of the functions described in this overview require other packages Particularly usefulfor rotating the results of factor analyses (from eg fa or principal) or hierarchicalfactor models using omega or schmid is the GPArotation package These and other useful

6

packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary

4 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

R codelibrary(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

R codeFirst lt- function(x) library(psych)

41 Getting the data by using readfile

Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly

R codemydata lt- readfile()

If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command

R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each

If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are

7

assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)

To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values

R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric

You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding

R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding

42 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

R codemydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

43 Basic descriptive statistics

Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized

gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441

44 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

9

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries

441 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic

442 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])

10

gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device

1

Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph

11

edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)gt round(both2)

education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)

education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA

443 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

12

gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device

1

Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well

13

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 5: How To: Use the psych package for Factor Analysis and data

At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading the entire overviewvignette helpful to get a broader understanding of what can be done in R using the psychRemember that the help command () is available for every function Try running theexamples for each help page

5

2 Overview of this and related documents

The psych package (Revelle 2017) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook

Some of the functions (eg readfile readclipboard describe pairspanels scat-terhist errorbars multihist bibars) are useful for basic data entry and descrip-tive analyses

Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Determining the num-ber of factors or components to extract may be done by using the Very Simple Structure(Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer 1976)(MAP) or parallel analysis (faparallel) criteria Item Response Theory (IRT) models fordichotomous or polytomous items may be found by factoring tetrachoric or polychoriccorrelation matrices and expressing the resulting parameters in terms of location and dis-crimination using irtfa Bifactor and hierarchical factor structures may be estimated byusing Schmid Leiman transformations (Schmid and Leiman 1957) (schmid) to transforma hierarchical factor structure into a bifactor solution (Holzinger and Swineford 1937)Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels corre-lation ldquoheat mapsrdquo (corplot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram as well as item response characteristics and itemand test information characteristic curves plotirt and plotpoly

3 Getting started

Some of the functions described in this overview require other packages Particularly usefulfor rotating the results of factor analyses (from eg fa or principal) or hierarchicalfactor models using omega or schmid is the GPArotation package These and other useful

6

packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary

4 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

R codelibrary(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

R codeFirst lt- function(x) library(psych)

41 Getting the data by using readfile

Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly

R codemydata lt- readfile()

If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command

R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each

If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are

7

assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)

To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values

R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric

You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding

R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding

42 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

R codemydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

43 Basic descriptive statistics

Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized

gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441

44 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

9

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries

441 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic

442 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])

10

gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device

1

Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph

11

edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)gt round(both2)

education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)

education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA

443 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

12

gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device

1

Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well

13

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 6: How To: Use the psych package for Factor Analysis and data

2 Overview of this and related documents

The psych package (Revelle 2017) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook

Some of the functions (eg readfile readclipboard describe pairspanels scat-terhist errorbars multihist bibars) are useful for basic data entry and descrip-tive analyses

Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Determining the num-ber of factors or components to extract may be done by using the Very Simple Structure(Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer 1976)(MAP) or parallel analysis (faparallel) criteria Item Response Theory (IRT) models fordichotomous or polytomous items may be found by factoring tetrachoric or polychoriccorrelation matrices and expressing the resulting parameters in terms of location and dis-crimination using irtfa Bifactor and hierarchical factor structures may be estimated byusing Schmid Leiman transformations (Schmid and Leiman 1957) (schmid) to transforma hierarchical factor structure into a bifactor solution (Holzinger and Swineford 1937)Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels corre-lation ldquoheat mapsrdquo (corplot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram as well as item response characteristics and itemand test information characteristic curves plotirt and plotpoly

3 Getting started

Some of the functions described in this overview require other packages Particularly usefulfor rotating the results of factor analyses (from eg fa or principal) or hierarchicalfactor models using omega or schmid is the GPArotation package These and other useful

6

packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary

4 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

R codelibrary(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

R codeFirst lt- function(x) library(psych)

41 Getting the data by using readfile

Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly

R codemydata lt- readfile()

If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command

R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each

If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are

7

assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)

To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values

R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric

You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding

R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding

42 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

R codemydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

43 Basic descriptive statistics

Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized

gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441

44 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

9

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries

441 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic

442 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])

10

gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device

1

Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph

11

edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)gt round(both2)

education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)

education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA

443 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

12

gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device

1

Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well

13

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 7: How To: Use the psych package for Factor Analysis and data

packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary

4 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

R codelibrary(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

R codeFirst lt- function(x) library(psych)

41 Getting the data by using readfile

Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly

R codemydata lt- readfile()

If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command

R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each

If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are

7

assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)

To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values

R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric

You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding

R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding

42 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

R codemydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

43 Basic descriptive statistics

Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized

gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441

44 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

9

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries

441 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic

442 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])

10

gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device

1

Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph

11

edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)gt round(both2)

education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)

education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA

443 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

12

gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device

1

Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well

13

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 8: How To: Use the psych package for Factor Analysis and data

assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)

To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values

R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric

You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding

R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding

42 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

R codemydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

43 Basic descriptive statistics

Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized

gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441

44 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

9

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries

441 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic

442 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])

10

gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device

1

Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph

11

edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)gt round(both2)

education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)

education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA

443 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

12

gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device

1

Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well

13

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 9: How To: Use the psych package for Factor Analysis and data

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

43 Basic descriptive statistics

Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized

gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441

44 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

9

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries

441 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic

442 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])

10

gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device

1

Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph

11

edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)gt round(both2)

education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)

education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA

443 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

12

gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device

1

Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well

13

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 10: How To: Use the psych package for Factor Analysis and data

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries

441 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic

442 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])

10

gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device

1

Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph

11

edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)gt round(both2)

education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)

education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA

443 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

12

gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device

1

Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well

13

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 11: How To: Use the psych package for Factor Analysis and data

gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device

1

Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph

11

edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)gt round(both2)

education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)

education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA

443 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

12

gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device

1

Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well

13

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 12: How To: Use the psych package for Factor Analysis and data

edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)gt round(both2)

education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)

education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA

443 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

12

gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device

1

Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well

13

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 13: How To: Use the psych package for Factor Analysis and data

gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device

1

Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well

13

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 14: How To: Use the psych package for Factor Analysis and data

gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device

1

Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect

14

45 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

51 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

15

or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

16

confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated

nfactors A number of different tests for the number of factors problem are run

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

511 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix

nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done

17

in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)

512 Principal Axis Factor Analysis

An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)

513 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller

18

Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100

With factor correlations ofMR1 MR2 MR3

MR1 100 059 054MR2 059 100 052MR3 054 052 100

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy

MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063

19

Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055

PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100

PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100

20

overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08

MR3

Factor Analysis

Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

21

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000

With factor correlations ofWLS1 WLS2 WLS3

WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000

Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy

WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627

22

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

23

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

514 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution

515 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet

24

Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100

With component correlations ofRC1 RC2 RC3

RC1 100 051 053RC2 051 100 044RC3 053 044 100

Mean item complexity = 12Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

25

another solution is to use structural equation modeling to directly solve for the general andgroup factors

gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 6 A higher order factor solution to the Thurstone 9 variable problem

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations

26

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

4LetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 7 A bifactor factor solution to the Thurstone 9 variable problem

27

516 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)

28

Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results

gt data(bfi)gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using

29

Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST

Purified AlphaC20 C16 C15 C21080 081 073 061

Guttman Lambda6C20 C16 C15 C21082 081 072 061

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061

30

progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster

Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

52 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

53 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)

C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012

31

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations

32

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion

33

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

031

C18α = 076β = 056

063

A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C19α = 08β = 066

minus072C12

α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C5minus091E4 076

E2 minus076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C6077N2 087

N1 087

Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)

34

Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

Purified AlphaC20 C16 C15 C21080 081 073 061

G6 reliabilityC20 C16 C15 C21083 100 067 038

Original BetaC20 C16 C15 C21063 076 067 027

Cluster sizeC20 C16 C15 C2110 5 5 5

Item by Cluster Structure matrixO P C20 C16 C15 C21

A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053

With eigenvalues ofC20 C16 C15 C2132 26 19 15

Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal

C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL

35

Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10

MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100

With factor correlations ofMR2 MR1

MR2 100 032MR1 032 100

Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy

MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper

A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001

Interfactor correlations and bootstrapped confidence intervalslower estimate upper

MR2-MR1 028 032 037gt

36

PA3 035 -024 088 -037PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)

Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

54 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the

37

talus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

541 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal

38

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

39

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC

1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727

542 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices

55 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created

40

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

41

to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

6 Classical Test Theory and Reliability

Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(V )x

Vx

)=

nnminus1

Vxminus tr(V x)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability

42

gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)

Factor analysis and extension

V3

V11

V9

V1

V5

V13

V7

V15

MR1

06

minus06

minus06

06

MR2

06

minus06

06

minus05

V12

minus07

V206

V406

V10minus06

V14minus06

V605

V805

V16

minus05

Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

43

the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds

44

total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

61 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysisCall alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries077 08 083

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100

45

V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively

gt keys lt- c(1-111111)gt alpha(attitudekeys)

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55

lower alpha upper 95 confidence boundaries018 043 068

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103

Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results

gt keys lt- c(1111111)gt alpha(attitudekeys)

46

Reliability analysisCall alpha(x = attitude keys = keys)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries076 084 093

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043

Item statisticsn rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures

Reliability analysisCall alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries068 072 076

Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022

Item statisticsn rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102

47

Non missing response frequency for each item-2 -1 0 1 2 miss

V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0

62 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the

48

method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =1ccprime1 + 1AAprime1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

49

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor

ωh =1ccprime1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =1ccprime1

1ccprime1 + 1AAprime1prime

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

gt om9

9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003

50

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure

51

RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

63 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500)

Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083

Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063

52

With eigenvalues ofg F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871

Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009

RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797

Measures of factor score adequacyg F1 F2 F3

Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subsetg F1 F2 F3

Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019

Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of

g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083

With eigenvalues ofg F1 F2 F3

260 047 040 033

53

631 Other estimates of reliability

Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound

gt splitHalf(r9)

Split half reliabilitiesCall splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072

64 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

641 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser

There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged

54

to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening

Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28

gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys

Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0

55

age 0 0 0 0 0

The use of multiple key matrices for different inventories is facilitated by using the su-

perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work

gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))

The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data

This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales

Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable

gt scores lt- scoreItems(keysbfi)gt scores

Call scoreItems(keys = keys items = bfi)

(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness

alpha 07 072 076 081 06

Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness

ASE 0014 0014 0013 0011 0017

Average item correlationAgree Conscientious Extraversion Neuroticism Openness

averager 032 034 039 046 023

Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness

Lambda6 07 072 076 081 06

SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness

SignalNoise 23 26 32 43 15

Scale intercorrelations corrected for attenuation

56

raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060

In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16

642 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use clus-

tercor

gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)

Call clustercor(keys = keys rmat = rbfi)

Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal

Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

57

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz

2

Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

58

65 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha[1] 084

Average item correlation[1] 025

item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039

gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001

59

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

66 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)

7 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)

60

gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

61

71 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012

62

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623

The item information functions show that not all of items are equally good (Figure 19)

These procedures can be generalized to more than one factor by specifying the number of

63

gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

64

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

65

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

72 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis

Summary information by factor and itemFactor = 1

-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002

66

gt iqirt lt- irtfa(iqtf)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

67

letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)

Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0

The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876

73 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)

68

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

reason17

reason4

reason19

reason16

matrix45

matrix46

matrix55

F1

07060605

F2

05

04

04

03

F306020202

F408

03

g

06

06

05

06

07

06

06

06

06

07

06

06

06

05

05

04

Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

69

gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 22

8 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

81 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

70

Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus3 0 2

078 040

00 04 08

000 046

00 04 08

040

minus3

02

minus003

minus3

02

irt theta

072 001 079 072 minus003

total

002 099 100

00

04

08

minus003

00

04

08

fit

minus005 002 090

rasch

099minus

20

2

minus008

00

04

08

total

minus003

minus3 0 2

00 04 08

minus2 0 2

00 06

00

06

fit

Comparing true theta for IRT Rasch and classically based scoring

71

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

82 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

9 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

72

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)

Calllm(formula = ACT ~ gender + education + age data = satact)

ResidualsMin 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1

Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setcor

gt compare with matregressgt setcor(c(46)c(13)C nobs=700)

73

Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)

Multiple Regression from matrix input

Beta weightsACT SATV SATQ

gender -005 -003 -018education 014 010 010age 003 -010 -009

Multiple RACT SATV SATQ016 010 019multiple R2

ACT SATV SATQ00272 00096 00359

Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001

SE of Beta weightsACT SATV SATQ

gender 018 429 434education 022 513 518age 022 511 516

t of Beta WeightsACT SATV SATQ

gender -027 -001 -004education 065 002 002age 015 -002 -002

Probability of t ltACT SATV SATQ

gender 079 099 097education 051 098 098age 088 098 099

Shrunken R2ACT SATV SATQ

00230 00054 00317

Standard Error of R2ACT SATV SATQ

00120 00073 00137

FACT SATV SATQ649 226 863

Probability of F ltACT SATV SATQ

248e-04 808e-02 124e-05

74

degrees of freedom of regression[1] 3 696

Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56

Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001

Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

10 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

75

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

76

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

11 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

77

gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

79

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

80

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 128package=psych)

81

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)

locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8

attached base packages[1] stats graphics grDevices utils datasets methods base

other attached packages[1] psych_165

loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4

82

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

83

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

84

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353

85

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

86

  • Overview of this and related documents
    • Jump starting the psych packagendasha guide for the impatient
      • Overview of this and related documents
      • Getting started
      • Basic data analysis
        • Getting the data by using readfile
        • Data input from the clipboard
        • Basic descriptive statistics
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Correlational structure
          • Heatmap displays of correlational structure
            • Polychoric tetrachoric polyserial and biserial correlations
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Item Response Theory analysis
                                        • Factor analysis and Item Response Theory
                                        • Speeding up analyses
                                        • IRT based scoring
                                          • Multilevel modeling
                                            • Decomposing data into within and between level correlations using statsBy
                                            • Generating and displaying multilevel data
                                              • Set Correlation and Multiple Regression from the correlation matrix
                                              • Simulation functions
                                              • Graphical Displays
                                              • Miscellaneous functions
                                              • Data sets
                                              • Development version and a users guide
                                              • Psychometric Theory
                                              • SessionInfo
Page 15: How To: Use the psych package for Factor Analysis and data
Page 16: How To: Use the psych package for Factor Analysis and data
Page 17: How To: Use the psych package for Factor Analysis and data
Page 18: How To: Use the psych package for Factor Analysis and data
Page 19: How To: Use the psych package for Factor Analysis and data
Page 20: How To: Use the psych package for Factor Analysis and data
Page 21: How To: Use the psych package for Factor Analysis and data
Page 22: How To: Use the psych package for Factor Analysis and data
Page 23: How To: Use the psych package for Factor Analysis and data
Page 24: How To: Use the psych package for Factor Analysis and data
Page 25: How To: Use the psych package for Factor Analysis and data
Page 26: How To: Use the psych package for Factor Analysis and data
Page 27: How To: Use the psych package for Factor Analysis and data
Page 28: How To: Use the psych package for Factor Analysis and data
Page 29: How To: Use the psych package for Factor Analysis and data
Page 30: How To: Use the psych package for Factor Analysis and data
Page 31: How To: Use the psych package for Factor Analysis and data
Page 32: How To: Use the psych package for Factor Analysis and data
Page 33: How To: Use the psych package for Factor Analysis and data
Page 34: How To: Use the psych package for Factor Analysis and data
Page 35: How To: Use the psych package for Factor Analysis and data
Page 36: How To: Use the psych package for Factor Analysis and data
Page 37: How To: Use the psych package for Factor Analysis and data
Page 38: How To: Use the psych package for Factor Analysis and data
Page 39: How To: Use the psych package for Factor Analysis and data
Page 40: How To: Use the psych package for Factor Analysis and data
Page 41: How To: Use the psych package for Factor Analysis and data
Page 42: How To: Use the psych package for Factor Analysis and data
Page 43: How To: Use the psych package for Factor Analysis and data
Page 44: How To: Use the psych package for Factor Analysis and data
Page 45: How To: Use the psych package for Factor Analysis and data
Page 46: How To: Use the psych package for Factor Analysis and data
Page 47: How To: Use the psych package for Factor Analysis and data
Page 48: How To: Use the psych package for Factor Analysis and data
Page 49: How To: Use the psych package for Factor Analysis and data
Page 50: How To: Use the psych package for Factor Analysis and data
Page 51: How To: Use the psych package for Factor Analysis and data
Page 52: How To: Use the psych package for Factor Analysis and data
Page 53: How To: Use the psych package for Factor Analysis and data
Page 54: How To: Use the psych package for Factor Analysis and data
Page 55: How To: Use the psych package for Factor Analysis and data
Page 56: How To: Use the psych package for Factor Analysis and data
Page 57: How To: Use the psych package for Factor Analysis and data
Page 58: How To: Use the psych package for Factor Analysis and data
Page 59: How To: Use the psych package for Factor Analysis and data
Page 60: How To: Use the psych package for Factor Analysis and data
Page 61: How To: Use the psych package for Factor Analysis and data
Page 62: How To: Use the psych package for Factor Analysis and data
Page 63: How To: Use the psych package for Factor Analysis and data
Page 64: How To: Use the psych package for Factor Analysis and data
Page 65: How To: Use the psych package for Factor Analysis and data
Page 66: How To: Use the psych package for Factor Analysis and data
Page 67: How To: Use the psych package for Factor Analysis and data
Page 68: How To: Use the psych package for Factor Analysis and data
Page 69: How To: Use the psych package for Factor Analysis and data
Page 70: How To: Use the psych package for Factor Analysis and data
Page 71: How To: Use the psych package for Factor Analysis and data
Page 72: How To: Use the psych package for Factor Analysis and data
Page 73: How To: Use the psych package for Factor Analysis and data
Page 74: How To: Use the psych package for Factor Analysis and data
Page 75: How To: Use the psych package for Factor Analysis and data
Page 76: How To: Use the psych package for Factor Analysis and data
Page 77: How To: Use the psych package for Factor Analysis and data
Page 78: How To: Use the psych package for Factor Analysis and data
Page 79: How To: Use the psych package for Factor Analysis and data
Page 80: How To: Use the psych package for Factor Analysis and data
Page 81: How To: Use the psych package for Factor Analysis and data
Page 82: How To: Use the psych package for Factor Analysis and data
Page 83: How To: Use the psych package for Factor Analysis and data
Page 84: How To: Use the psych package for Factor Analysis and data
Page 85: How To: Use the psych package for Factor Analysis and data
Page 86: How To: Use the psych package for Factor Analysis and data